Poisson naive Bayes for text classification with feature weighting
نویسندگان
چکیده
In this paper, we investigate the use of multivariate Poisson model and feature weighting to learn naive Bayes text classifier. Our new naive Bayes text classification model assumes that a document is generated by a multivariate Poisson model while the previous works consider a document as a vector of binary term features based on the presence or absence of each term. We also explore the use of feature weighting for the naive Bayes text classification rather than feature selection, which is a quite costly process when a small number of the new training documents are continuously provided. Experimental results on the two test collections indicate that our new model with the proposed parameter estimation and the feature weighting technique leads to substantial improvements compared to the unigram language model classifiers that are known to outperform the original pure naive Bayes text classifiers.
منابع مشابه
A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کاملA New Fine-Grained Weighting Method in Multi-Label Text Classification
Multi-label classification is one of the important research areas in data mining. In this paper, a new multilabel classification method using multinomial naive Bayes is proposed. We use a new fine-grained weighting method for calculating the weights of feature values in multinomial naive Bayes. Our experiments show that the value weighting method could improve the performance of multinomial nai...
متن کاملWeighted Neural Bag-of-n-grams Model: New Baselines for Text Classification
NBSVM is one of the most popular methods for text classification and has been widely used as baselines for various text representation approaches. It uses Naive Bayes (NB) feature to weight sparse bag-of-n-grams representation. N-gram captures word order in short context and NB feature assigns more weights to those important words. However, NBSVM suffers from sparsity problem and is reported to...
متن کاملSupervised Term Weighting Methods for URL Classification
Many term weighting methods are suggested in the literature for Information Retrieval and Text Categorization. Term weighting method, a part of feature selection process is not yet explored for URL classification problem. We classify a web page using its URL alone without fetching its content and hence URL based classification is faster than other methods. In this study, we investigate the use ...
متن کاملA Comparative Study on Feature Weight in Thai Document Categorization Framework
Text Categorization is the process of automatically assigning predefined categories to free text documents. Feature weighting, which calculates feature (term) values in documents, is one of important preprocessing techniques in text categorization. This paper is a comparative study of feature weighting methods in statistical learning of Thai Document Categorization Framework. Six methods were e...
متن کامل